An Exploratory Data and Network Analysis of Movies

Introduction

In this report, we will be analysing a dataset from Kaggle, which contains movies of different genres produced over a vast number of years. What makes this analysis interesting is that we can try and draw various conclusions based on a movie’s popularity, directors or actors involved, year of production, and so forth. Moreover, we can construct various networks in an attempt to find meaningful and interesting results. When inspecting a database of films from recent years, various interesting inferences are uncovered. A film may have a high rating yet low return on investment (ROI). Which genre would you guess is the most successful? Which actors do you think are the most popular?

We have split our Exploratory Data Analysis into four main parts:

Section
1 Introducing the Data
- We first try to understand the data and look at its content.
2 Pre-Processing
- We look at what needs to be altered or removed from the dataset.
- We try clean any dirty text.
- We try to minimise the dataset’s missing values.
3 Exploring the Data
- We conduct basic analysis on the dataset.
- We explore genres.
- We explore movie popularity.
- We look at profit, gross, and return of interests with movies.
- We conduct more advanced analysis on the dataset.
4 Network Analysis
- We measure the network (centrality, degree distribution, number of components, average degree)
- We use network measures to highlight certain nodes (actors) and see which measures of an actor will increase ratings and budgets.

Admin

Before we start, let’s keep this code chunk for importing the correct libraries and loading the appropriate dataset. We use pacman to load the following:

We import the dataset like this:

In the next section we introduce our dataset and look its content.


Introducing The Dataset

This section of the report is quite essential for our analysis. We cannot make any interesting inferences from the dataset if we do not know what is contained within it. In this section we will try to understand exactly what we are dealing with. Thereafter, we can begin to draw interesting results.

The movie_metadata dataset contains 28 unique columns/variables, each of which are described in the table below:

Variable Name Description
color Specifies whether a movie is in black and white or color
director_name Contains name of the director of a movie
num_critic_for_reviews Contains number of critic reviews per movie
duration Contains duration of a movie in minutes
director_facebook_likes Contains number of facebook likes for a director
actor_3_facebook_likes Contains number of facebook likes for actor 3
actor_2_name Contains name of 2nd leading actor of a movie
actor_1_facebook_likes Contains number of facebook likes for actor 1
gross Contains the amount a movie grossed in USD
genres Contains the sub-genres to which a movie belongs
actor_1_name Contains name of the actor in lead role
movie_title Title of the Movie
num_voted_users Contains number of users votes for a movie
cast_total_facebook_likes Contains number of facebook likes for the entire cast of a movie
actor_3_name Contains the name of the 3rd leading actor of a movie
facenumber_in_poster Contains number of actors faces on a movie poster
plot_keywords Contains key plot words associated with a movie
movie_imdb_link Contains the link to the imdb movie page
num_user_for_reviews Contains the number of user generated reviews per movie
language Contains the language of a movie
country Contains the name of the country in which a movie was made
content_rating Contains maturity rating of a movie
budget Contains the amount of money spent in production per movie
title_year Contains the year in which a film was released
actor_2_facebook_likes Contains number of facebook likes for actor 2
imdb_score Contains user generated rating per movie
aspect_ratio Contains the size of the aspect ratio of a movie
movie_facebook_likes Number of likes of the movie on its Facebook Page

Furthermore, the dataset contains 5043 movies, spanning across 96 years in 46 countries. There are 1693 unique director names and 5390 number of actors/actresses. Around 79% of the movies are from the USA, 8% from UK, and 13% from other countries.

The structure of the dataset can also be used to understand our data. We can run the following code chunk to see its structure.

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
##  $ actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ movie_title              : chr  "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : chr  "English" "English" "English" "English" ...
##  $ country                  : chr  "USA" "USA" "UK" "USA" ...
##  $ content_rating           : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ budget                   : num  237000000 300000000 245000000 250000000 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

In the next section we can start preparing the dataset for analysis by removing and simplifying some of the data.


Pre-Processing Data

In this part of the report we attempt to look for various things that may have a negative or significant impact on the inferences we make on the dataset. Once we have sufficiently cleaned and prepared the dataset, we can commence with drawing various conclusions from the graphs we generate.

Duplicate Rows

In the movie_metadata dataset, we can derive that their are 45 duplicated rows which needs to be removed and kept the unique ones.

## [1] 45

Missing Values

Let’s have a look at the number of NA values in our dataset:

##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                        49                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       103                        23 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         7 
##                     gross                    genres 
##                       874                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         0                         0 
##              actor_3_name      facenumber_in_poster 
##                         0                        13 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                        21                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                       487                       107 
##    actor_2_facebook_likes                imdb_score 
##                        13                         0 
##              aspect_ratio      movie_facebook_likes 
##                       327                         0

To help visualise this, have a look at the following heatmap of the missing values:

## 
##  Variables sorted by number of missings: 
##                   Variable       Count
##                      gross 0.174869948
##                     budget 0.097438976
##               aspect_ratio 0.065426170
##                 title_year 0.021408563
##    director_facebook_likes 0.020608243
##     num_critic_for_reviews 0.009803922
##     actor_3_facebook_likes 0.004601841
##       num_user_for_reviews 0.004201681
##                   duration 0.003001200
##       facenumber_in_poster 0.002601040
##     actor_2_facebook_likes 0.002601040
##     actor_1_facebook_likes 0.001400560
##                      color 0.000000000
##              director_name 0.000000000
##               actor_2_name 0.000000000
##                     genres 0.000000000
##               actor_1_name 0.000000000
##                movie_title 0.000000000
##            num_voted_users 0.000000000
##  cast_total_facebook_likes 0.000000000
##               actor_3_name 0.000000000
##              plot_keywords 0.000000000
##            movie_imdb_link 0.000000000
##                   language 0.000000000
##                    country 0.000000000
##             content_rating 0.000000000
##                 imdb_score 0.000000000
##       movie_facebook_likes 0.000000000

Gross and Budget

Since gross and budget have too many missing values (874 and 488), and we want to keep these two variables for the following analysis, we can only delete rows with null values for gross and budget because imputation will not do a good job here.

## [1] 3857   28

The difference in observations have decreased by 4998 - 3857 = 1141 which is luckily only 22.8% of the previous total observations.

Content Rating

The dataset contains a vast range of content rating, which can be seen below:

## 
##            Approved         G        GP         M     NC-17 Not Rated 
##        51        17        91         1         2         6        42 
##    Passed        PG     PG-13         R   Unrated         X 
##         3       573      1314      1723        24        10

We find that M = GP = PG, X = NC-17, so let’s replace M and GP with PG, and X with NC-17, because these are (apparently) what we use today.

We want to replace Approved, Not Rated, Passed, Unrated with the most common rating R.

## 
##           G NC-17    PG PG-13     R 
##    51    91    16   576  1314  1809

Blank spaces should be taken as missing value. Since these missing values cannot be replaced with reasonable data, we delete these rows.

Delete (Some) Rows

Let’s now have a look at how many complete cases we have.

##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                         1                         0 
##   director_facebook_likes    actor_3_facebook_likes 
##                         0                         6 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         1 
##                     gross                    genres 
##                         0                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         0                         0 
##              actor_3_name      facenumber_in_poster 
##                         0                         6 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                         0                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                         0                         0 
##    actor_2_facebook_likes                imdb_score 
##                         2                         0 
##              aspect_ratio      movie_facebook_likes 
##                        55                         0

We remove aspect_ratio because 1. it has a lot of missing values and 2. we will not be looking into the impact that it has on other data (we assume that it doesn’t).

Add a Column

Gross and Budget

We have gross and budget information. So let’s add two columns: profit and percentage return on investment for further analysis.

Remove (Some) Columns

Colour

Next, we take a look at the influence of colour vs black and white.

## 
##                   Black and White            Color 
##                2              124             3680

Since 3.4% of the data is in black and white, we can remove the color column it.

Language

Let’s have a look at the different languages contained within the dataset.

## 
##            Aboriginal     Arabic    Aramaic    Bosnian  Cantonese 
##          2          2          1          1          1          7 
##      Czech     Danish       Dari      Dutch    English   Filipino 
##          1          3          2          3       3644          1 
##     French     German     Hebrew      Hindi  Hungarian Indonesian 
##         34         11          2          5          1          2 
##    Italian   Japanese     Kazakh     Korean   Mandarin       Maya 
##          7         10          1          5         14          1 
##  Mongolian       None  Norwegian    Persian Portuguese   Romanian 
##          1          1          4          3          5          1 
##    Russian    Spanish       Thai Vietnamese       Zulu 
##          1         24          3          1          1

Almost 95% movies are in English, which means this variable is nearly constant. Let’s remove it.

Country

Next, we can look at the different types of countries.

## 
##    Afghanistan      Argentina          Aruba      Australia        Belgium 
##              1              3              1             40              1 
##         Brazil         Canada          Chile          China       Colombia 
##              5             63              1             13              1 
## Czech Republic        Denmark        Finland         France        Georgia 
##              3              9              1            103              1 
##        Germany         Greece      Hong Kong        Hungary        Iceland 
##             79              1             13              2              1 
##          India      Indonesia           Iran        Ireland         Israel 
##              5              1              4              7              2 
##          Italy          Japan         Mexico    Netherlands       New Line 
##             11             15             10              3              1 
##    New Zealand         Norway  Official site           Peru    Philippines 
##             11              4              1              1              1 
##         Poland        Romania         Russia   South Africa    South Korea 
##              1              2              3              3              8 
##          Spain         Taiwan       Thailand             UK            USA 
##             22              2              4            316           3025 
##   West Germany 
##              1

Around 79% movies are from USA, 8% from UK, 13% from other countries. So we group other countries together to make this categorical variable with less levels: USA, UK, Others.

## 
## Others     UK    USA 
##    465    316   3025

Now that we’ve cleaned up our dataset, we can now continue to explore our data even further! In the next section we will be looking at genres, movie popularity, gross, profit, and many more other aspects pertinent to our data.


Analysing Data

When inspecting a dataset of movies over the past few years, various interesting inferences can be uncovered. A movie may have a high rating yet low return on investment. Which genre is the most successful? Which actors are the most popular? These are some of the questions we aim to answer in this section.

We can start by performing basic analyis on our data. Thereafter, we delve a bit deeper into more specific parts of the dataset, in hopes of uncovering interesting observations.

Basic Analysis

Let’s first have a look at the number of movies that are produced over the years.

From the graph, we see there aren’t many records of movies released before 1980. It’s better to remove those records because they might not be representative of the data.

Let’s have a look at the movie counts now:

The graph above illustrate the number of movies released for the period 1980 - 2016. As we can see, from the 1980’s a quick and exponential rise of movies released occurred.

Movie Genre Analysis

Now we can delve into more specific things regarding movies, like genres.

Top Genres

From the above a combination of Comedy, Romance, and Drama appears to be, by far, the most frequent produced genres. As you can see, movies have multiple genres that they are associated with. For analysis purposes, we choose to use the first word in the genre column, as this is likely to be the most accurate description of the movie.

Split Genres

Here we first split the genres into multiple columns and merge them together.

## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy"       
## [3] "Action|Adventure|Thriller"       "Action|Thriller"                
## [5] "Action|Adventure|Sci-Fi"         "Action|Adventure|Romance"

Let’s split the genres separated by “|” into 8 different columns.

genre_df consists of 8 columns, each with different genres. Let’s have a look at the frequency of ALL the genres.

It is evident that the Drama and Comedy genre are still the most popular to be produced. It is also interesting to note that Romance is fairly lower on the list as in the previous graph. This may imply that most movies predominantly co-occur with Comedy or Drama. Romance movies do not co-occur with any of the other genres as frequently as Comedy and Drama do. Additionally, the fact that Comedy and Drama occurs the most does not necessarily mean that they are the most profitable, returning successful ROI’s. We will try and explore this later on in the report.

Previously we assumed that the first genre is the most applicable, therefore, we choose the first column as the genre for the movie and append it to the dataframe.

How does this distribution look like over the years? Lets have a look at the frequency of genres between the period of 1980 and 2016.

We can make one or two remarks from this heatmap: Firstly, we can see that Action, Adventure,Comedy, and Drama are genres that are predominantly used as the first term to associate with a movie. Secondly, we can see that Romance, which was previously high in frequency for co-occurring with Comedy and Drama is now very low. This means that there are very few movies that are predominantly Romance, meaning Romance is mostly the second, third, etc term to describe a movie.

Additionally, we can also see that some genres, like Action and Comedy have picked up over the years. This is evident by looking at the darker shades of blue becoming more prominent in the latter years.

Popularity Analysis

IMDB ratings VS Movie Count

Let’s have a look at the IMDB rating distribution on the number of movies that are produced. Below we can see that the data is slightly skewed to the left. We have a concentration of data among 6 out of 10 and a long tail to the left. The vast majority of the movies are given a score between 5 and 7.5, with fewer movies scoring higher than that.

The lowest scored movie, titled Justin Bieber: Never Say Never, is 1.6 whereas the highest score is 9.3 for The Shawshank Redemption. The mean of the imdb scores is 6.433288.

Facebook Likes VS IMDB Score

There are three things to look at with this graph: Average IMDB score, average Facebook likes and the number of movies rated per content-rating. We can see that all movies have (on average) very similar IMDB scores, however, they differ higly on the number of Facebook likes. For example, movies rated PG-13 receive much more Facebook likes (on average) than movies rated NC-17. Movies rated R receive (on average) more likes than movies rated NC-17, but still have relatively similar IMDB scores.

IMDB Scores VS Facebook Likes

We can infer a strong correlation between a movie’s Facebook likes and its IMDB Score. This is expected, as a higher individual rating relates to higher viewer satisfaction; and hence it is expected to see an increase in positive online presence. Initially, this graph was constructed to see if there’d be a difference between viewer enjoyment and movie rating. Movie databases are often critisized for the nature of their rating scales, made by critics and placing priority on sentiments and plot, which may not fully coincide with viewer enjoyment. However, as seen below, this is not the case using IMDB’s Scoring.

Top 20 directors with highest average IMDB score

Let’s take a look at the directors. We can see that the top IMDB rated directors have very similar scores (8.1 - 8.6). Tony Kaye has the highest rating of 8.6.

director_name avg_imdb
Tony Kaye 8.600000
Damien Chazelle 8.500000
Majid Majidi 8.500000
Ron Fricke 8.500000
Christopher Nolan 8.425000
Asghar Farhadi 8.400000
Marius A. Markevicius 8.400000
Richard Marquand 8.400000
Sergio Leone 8.400000
Lee Unkrich 8.300000
Lenny Abrahamson 8.300000
Pete Docter 8.233333
Hayao Miyazaki 8.225000
Joshua Oppenheimer 8.200000
Juan José Campanella 8.200000
Quentin Tarantino 8.200000
David Sington 8.100000
Je-kyu Kang 8.100000
Terry George 8.100000
Tim Miller 8.100000

Vote Counts VS IMDB score

From the above scatter plot, it is evident that the majority of the movie ratings are clustered at 7.5 points.

Average score, avg votes and, avg user reviews

Each line represents an average, for the imdb score, votes and user review.

Profit | Gross | Return of Interest

In the above graph, it is evident that the movies with with higher budgets do not essentially mean that they will equate to a high gross profit.

Most Successful Directors based on Profit

Looking at the most succesful directors, one can determine that the most successful director produced only one movie (Tim Miller) or creating an array of succesful films with large budgets, such as James Cameron.

Top 20 movies based on its Profit

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

When assessing the top 20 movies based on profit, Avatar has the highest profit margin, regioning in a similar area to director James cameron.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As we can see from the above graph, even though comedy and drama are the most popular to be produced, they, overall, are not profitable. The budgets from western and thriller movies are far smaller than that of other genres.

Top 20 movies based on its Return on Investment

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Sucessful directors such as George Lucas also have profitable movies.

Further Analysis

Correlation Heatmap

## Warning in ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size
## = 2.8, : data in column(s) 'director_name', 'actor_2_name', 'actor_1_name',
## 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link',
## 'language', 'country', 'content_rating', 'genre', 'vote_bucket' are not
## numeric and were ignored

Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors.

According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.

There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews.

Sentiment Analysis

Exploration into Movie Plot Keywords

Keywords from movie plot lines are given in a single column, split by a “|”. The following code will seperate these words into seperate columns.

## Selecting by n


Above we can see the most popular keywords describing the films in the dataset.

The following code checks to see which movies contain common keywords.

##                movie_title                                 plot_keywords
## 1                The Doors death|paris france|rock band|singer|the doors
## 2 For a Good Time, Call...           friend|friendship|gay|phone sex|sex


Here are two movies that contain at least one common plot keyword.

##              movie_title
## 1 The Mothman Prophecies
## 2     The Girl Next Door
##                                                                  plot_keywords
## 1 car accident|death of wife|mothman|point pleasant west virginia|urban legend
## 2  forced to strip|male rear nudity|porn actress|scantily clad female|teenager

Here are two movies that do not contain any of the common plot key words.


This graph is comparing the average gross of a movie that contains at least one keyword from the list of top 20 most common plot keyword and movies that contain none of these popular keywords.

The average gross for movies that do not contain one of the top 20 most common keywords appear to be higher.

Calculating Sentiment

Below is where the sentiment of keywords is calculated. The sentiment function used comes from the syuzhet package and it can detect the presence of eight different emotions, namely “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise” and, “trust”. It is also able to calculate positive and negative valence.

Sentiment results

As previously mentioned, thre number of movies dramatically increases from 1980 onwards so the calculated sentiments are filtered to reduce sensitivity of the data.


What is immeadiately visible is the differences in type of emotions in plot keywords has reduced over the years. The movies

Below the trend of positive and negative sentiment is explored.


Generally speaking, the sentiment of common keywords used has been more positive, but has remained relatively constant, apart from the decrease in both positive and negative sentiment from around 1995 to 2010.


Network Analysis

Constructing the network graph

Actors will be the nodes. Edges exist only if the actors have appeared in a movie together.

##      actor_1_name     actor_2_name         actor_3_name
## 1     CCH Pounder Joel David Moore            Wes Studi
## 2     Johnny Depp    Orlando Bloom       Jack Davenport
## 3 Christoph Waltz     Rory Kinnear     Stephanie Sigman
## 4       Tom Hardy   Christian Bale Joseph Gordon-Levitt
## 5     Doug Walker       Rob Walker

The nodelist will only contain each actor’s name once.

##             value
## 1     CCH Pounder
## 2     Johnny Depp
## 3 Christoph Waltz
## 4       Tom Hardy
## 5    Daryl Sabara

Because each movie has three top actors given, some column manipulation is needed to format the data into the two “to” and “from” columns required for the edgelist.

##              from               to
## 1     CCH Pounder Joel David Moore
## 2     Johnny Depp    Orlando Bloom
## 3 Christoph Waltz     Rory Kinnear
## 4       Tom Hardy   Christian Bale
## 5    Daryl Sabara  Samantha Morton
##              from                   to
## 1     CCH Pounder            Wes Studi
## 2     Johnny Depp       Jack Davenport
## 3 Christoph Waltz     Stephanie Sigman
## 4       Tom Hardy Joseph Gordon-Levitt
## 5    Daryl Sabara         Polly Walker
##               from                   to
## 1 Joel David Moore            Wes Studi
## 2    Orlando Bloom       Jack Davenport
## 3     Rory Kinnear     Stephanie Sigman
## 4   Christian Bale Joseph Gordon-Levitt
## 5  Samantha Morton         Polly Walker
##              from               to
## 1     CCH Pounder Joel David Moore
## 2     Johnny Depp    Orlando Bloom
## 3 Christoph Waltz     Rory Kinnear
## 4       Tom Hardy   Christian Bale
## 5    Daryl Sabara  Samantha Morton
## 6    J.K. Simmons     James Franco

Here is a simple network plot of the constructed network.
Here a densely connected core can be seen surrounded by many small nodes that are not connected to the main component.

Network measures

Eigenvector centrality (also called eigencentrality) is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.

## Warning in vattrs[[name]][index] <- value: number of items to replace is
## not a multiple of replacement length


Eigen values are negatively skewed for global community 1. This means that most nodes are not connected to high scoring nodes.

In a connected graph, the normalized closeness centrality (or closeness) of a node is the average length of the shortest path between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes. (stolen from Wiki)

An actor will be well connected if other many actors can be reached in a short number of hops.


The closeness distribution is very interesting. There is a high number of nodes with a relatively high and relatively low closeness. This is due to the graph having many small components and one very densely connected large component. The mean value is 0.0000001.

Interpretively, the Boncich power measure corresponds to the notion that the power of a vertex is recursively defined by the sum of the power of its alters. The nature of the recursion involved is then controlled by the power exponent: positive values imply that vertices become more powerful as their alters become more powerful (as occurs in cooperative relations), while negative values imply that vertices become more powerful only as their alters become weaker (as occurs in competitive or antagonistic relations). (stolen from Wiki)

Essentially, the importance of an actor is defined by the importance of alters, or other connected actors.


So the distribution of Boncich power is slightly positively skewed meaning that in general, vertices are considered more ‘powerful’ as their alters increase in power. The max power centrality is 14.7905526.

The PageRank algorithm ignores edge weights when calculating the importance of nodes. The more likely an actor will be found when randomly searching through movies, the higher the assigned PageRank.


Most nodes have a relatively low page rank. The mean page rank is 0.0001859.

Community Detection

Group Louvain optimises for modularity in the network and therefore tries to create densely connected clusters with sparse connections between the clusters. The densely connected core may not allow for very modular communities to be identified.

Distribution of community size


This graph shows the distribution of community size. The community size is exponentially distributed, resulting in a few large communities and many smaller ones. Some form of filtering on community size is needed to remove the smaller communities.


After removing communities smaller than 100 actors, only 17 communities remain.

## Warning in length(vattrs[[name]]) <- vc: length of NULL cannot be changed

Visualising the network

The 17 remaining communities were analysed in gephi. The nodes are coloured and grouped by community, while the size of the node and text are dependent on the degree of the node.


It is clear that the remaining communities are very densely connected meaning that even after optimising for modularity, actors have many connections outside their community. These dense connections may have negatively impacted the results of the Group Louvain and there is concern as to the true modularity of these communities.

## # A tibble: 3,890 x 7
##    name        g_e_values g_close_cent g_power_cent g_page_rank  comm     n
##    <chr>            <dbl>        <dbl>        <dbl>       <dbl> <int> <int>
##  1 Adam Brown       41.7   0.000000142       1.94     0.0000394     1   535
##  2 Adam Hicks       -2.70  0.000000142      -0.0684   0.000130      1   535
##  3 Adrian Alo…      -1.    0.000000142      -0.328    0.0000640     1   535
##  4 Adrian Les…       3.25  0.000000142       1.65     0.0000517     1   535
##  5 Aida Turtu…      -1.    0.000000142      -1.09     0.000100      1   535
##  6 Aimee Teeg…       1.23  0.000000142      -0.304    0.000153      1   535
##  7 AJ Michalka       6.78  0.000000142       0.0789   0.0000823     1   535
##  8 Al Roker         -1.    0.000000142      -0.475    0.0000559     1   535
##  9 Alan Young        4.94  0.000000142       0.232    0.0000745     1   535
## 10 Alastair D…      -2.25  0.000000142       0.611    0.0000594     1   535
## # … with 3,880 more rows

Analysis of Global Comm 1


Eigen values are positively skewed for global community 1. This means that most nodes are not connected to high scoring nodes.


The average closeness centrality is 0.0004944. When looking at a single community we expect a higher average closeness than when calculating for the whole graph which was 0.0000001.


The average Boncich power centrality is 6.653664.


The local mean page rank is 0.0018692, compared to the global mean of 0.0001859.

Creating the graph of centrality measures for community 1.

Analysis of Global Comm 2

This is the creation of the subgraph that will only contain vertices listed in community 2.


Eigen values are positively skewed for global community 2. This means that most nodes are not connected to high scoring nodes.


The mean closeness centrality is 0.0008507.


The mean Boncich power centrality is -0.2881859.


The mean page rank for community 2 is 0.0021645.

Creating the graph of centrality measures for community 2.

Analysis of Global Comm 3

This is the creation of the subgraph that will only contain vertices listed in community 3.


Eigen values are negatively skewed for global community 3. This means that most nodes are not connected to high scoring nodes.


The mean closeness centrality is 0.0007851.


The mean Boncich power centrality in community 3 is -0.0918364.


The mean page rank is 0.0022422.

Creating the graph of centrality measures for community 3.

Highlighting nodes using different menasures of importance.

The top nodes from selected communities will be compared to see which measure is the best indicator of higher ratings.

Overall measures
Highest degree
## [1] "Morgan Freeman"

Morgan Freeman has the highest degree of any node in th graph and could therefore be seen as an influential node, however he may not be a central one. Morgan Freeman has acted with the greatest number of distinct actors according to the movies in this dataset.

Closeness centrality
## [1] "Highest global closeness centrality: Morgan Freeman"

Due to the high degree, it is not surprising that Morgan Freeman has the highest level of closeness centrality across the graph.

Highest Page Ranking
## [1] "Highest global Page rank: Morgan Freeman"

Morgan Freeman is considered the most important node by the PageRank algorithm

Highest Boncich Power centrality
## [1] "Highest global Boncich power centrality: Matt Keeslar"

This actor himself is not considered the most influential however he has the most influential connections.

Community 1

Highest degree
## [1] "Rupert Everett"
Closeness centrality
## [1] "Highest global closeness centrality: Nathan Lane"
## [1] "Highest local closeness centrality: Jim Belushi"
Highest Page Ranking
## [1] "Highest global page rank: Richard Schiff"
## [1] "Highest local page rank: Richard Schiff"
Highest Boncich Power centrality
## [1] "Highest global power centrality: Michael McGlone"
## [1] "Highest local power centrality: William Sanderson"
Community 2

Closeness centrality
## [1] "Highest global closeness centrality: Robert Duvall"
## [1] "Highest global closeness centrality: Vanessa Redgrave"
Highest Page Ranking
## [1] "Highest global Page rank: Robert Duvall"
## [1] "Highest local Page rank: Vanessa Redgrave"
Highest Boncich Power centrality
## [1] "Highest global Boncich power centrality: Tabu"
## [1] "Highest local Boncich power centrality: Dan Fogler"

The actors Gary Coleman and Pamela Anderson have the highest global and local Boncich power centrality, respectively. This means that across the whole graph (but limited to vertices in community 2), Gary Coleman has the most powerful connections while Pamela Anderson has the highest number of powerful connections within community 2.

Community 3

Highest degree
## [1] "Eddie Marsan"
Closeness centrality
## [1] "Highest global closeness centrality: Steve Coogan"
## [1] "Highest local closeness centrality: Eddie Marsan"
Highest Page Ranking
## [1] "Highest global Page rank: Jon Favreau"
## [1] "Highest local Page rank: Eddie Marsan"
Highest Boncich Power centrality
## [1] "Highest global Boncich power centrality: Carter Jenkins"
## [1] "Highest local Boncich power centrality: Matthew R. Anderson"

The actors Tabu and Charlize Theron have the highest global and local Boncich power centrality, respectively. This means that across the whole graph (but limited to vertices in community 3), Tabu has the most powerful connections while Charlize Theron has the highest number of powerful connections within community 3.

Analysis of centrality measures and ratings

Here the average rating of movies starred in for each actor is calculated.

## Joining, by = c("actor", "avg_rating")
## Joining, by = c("actor", "avg_rating")

Overall

## # A tibble: 1 x 2
##   actor          avg_rating
##   <chr>               <dbl>
## 1 Morgan Freeman       7.76


Morgan Freeman has an average movie rating of 7.7605. In terms of the overall graph, this actor has the highest degree, closeness centrality and page rank.

## # A tibble: 1 x 2
##   actor        avg_rating
##   <chr>             <dbl>
## 1 Matt Keeslar          7


Matt Keeslar has an average movie rating of 7. In terms of the overall graph, this actor has the highest Boncich centrality meaning he has very influential alters.

Community 1
## # A tibble: 1 x 2
##   actor         avg_rating
##   <chr>              <dbl>
## 1 Tom Wilkinson       7.08


Tom Wilkinson has an average movie rating of 7.077083. In terms of community 1, this actor has the highest degree, global closeness centrality and global as well as local page rank.

## # A tibble: 1 x 2
##   actor              avg_rating
##   <chr>                   <dbl>
## 1 Miranda Richardson       6.86


Miranda Richardson has an average movie rating of 6.855. In terms of community 1, this actor has the highest local closeness centrality meaning she is very central within community 1 but not overall in the graph.

## # A tibble: 1 x 2
##   actor            avg_rating
##   <chr>                 <dbl>
## 1 R. Marcos Taylor        7.9


R. Marcos Taylor has an average movie rating of 7.9. In terms of community 1, this actor has the highest global Boncich centrality meaning that across the graph he has influential alters.

## # A tibble: 1 x 2
##   actor      avg_rating
##   <chr>           <dbl>
## 1 Eric Sykes        7.6


Eric Sykes has an average movie rating of 7.6. In terms of community 1, this actor has the highest local Boncich centrality meaning that if only looking at community 1, Eric Sykes has the most influential alters.

Community 2
## # A tibble: 1 x 2
##   actor              avg_rating
##   <chr>                   <dbl>
## 1 Scarlett Johansson       7.52


Scarlett Johansson has an average movie rating of 7.522159. In terms of community 2, this actor has the highest degree.

## # A tibble: 1 x 2
##   actor                avg_rating
##   <chr>                     <dbl>
## 1 Kristin Scott Thomas       6.94


Kristin Scott Thomas has an average movie rating of 6.939583. In terms of community 2, this actor has the highest global closeness centrality meaning she is very central overall in the graph but not the most central if only looking at community 2.

## # A tibble: 1 x 2
##   actor          avg_rating
##   <chr>               <dbl>
## 1 Rachael Harris       6.21


Rachael Harris has an average movie rating of 6.208333. In terms of community 2, this actor has the highest local closeness centrality meaning she is very central within community 2 but not overall in the graph.

## # A tibble: 1 x 2
##   actor        avg_rating
##   <chr>             <dbl>
## 1 Steve Coogan       6.29


Steve Coogan has an average movie rating of 6.2875. In terms of community 2, this actor has the highest global page rank.

## # A tibble: 1 x 2
##   actor          avg_rating
##   <chr>               <dbl>
## 1 Richard Schiff       6.14


Richard Schiff has an average movie rating of 6.143333. In terms of community 2, this actor has the highest local page rank.

## # A tibble: 1 x 2
##   actor        avg_rating
##   <chr>             <dbl>
## 1 Gary Coleman       6.15


Gary Coleman has an average movie rating of 6.15. In terms of community 2, this actor has the highest global Boncich centrality and has influential alters across the network.

## # A tibble: 1 x 2
##   actor           avg_rating
##   <chr>                <dbl>
## 1 Pamela Anderson        5.5


Pamela Anderson has an average movie rating of 5.5. In terms of community 2, this actor has the highest local Boncich centrality and has influential alters within community 2.

#####Community 3

## # A tibble: 1 x 2
##   actor          avg_rating
##   <chr>               <dbl>
## 1 Morgan Freeman       7.76


Morgan Freeman has an average movie rating of 7.7605. In terms of community 3, this actor has the highest degree, closeness centrality and page rank in terms of both local and global calculations.

## # A tibble: 1 x 2
##   actor avg_rating
##   <chr>      <dbl>
## 1 Tabu         7.8


Tabu acts primarily in Hindi films and is the only actor highlighted not from Western films. The average movie rating is 7.8 and in terms of community 3, Tabu has the highest global Boncich centrality.

## # A tibble: 1 x 2
##   actor           avg_rating
##   <chr>                <dbl>
## 1 Charlize Theron       6.59


Charlize Theron has an average movie rating of 6.586667 and has the highest local Boncich power centrality within community 3.

Centrality measures and ratings
## Joining, by = "name"
## Joining, by = "name"
## Joining, by = "name"


None of the graphs show any strong correlation between the global centrality and the average rating of the movie. It will now be explored whether using local centrality measures will produce a different outcome.


The local centralities do not appear to have any correlation to the average movie rating.

## # A tibble: 3 x 7
##    comm avg_rating  avg_e     avg_c_c  avg_p_r avg_p_c   num
##   <int>      <dbl>  <dbl>       <dbl>    <dbl>   <dbl> <int>
## 1     1       6.26 -0.225 0.000000142 0.000122  0.0292   535
## 2     2       6.53  1.01  0.000000142 0.000219 -0.0287   462
## 3     3       6.45  1.69  0.000000142 0.000185 -0.0677   446


Across the three graphs, the mean centrality scores remain constant across communitites with the exception of the eigen vector centrality of community 1.

Concluding remarks

It can be said that the centrality of nodes is not an indicator of success for movie ratings. The variance in movie ratings is relatively high for very central nodes as well as not as central nodes. There is doubt as to the reliability of the communities due to the densely connected nature of the graph. The

Clarice, Daven, Lucia, Christopher and Indurain

September 6, 2019